“The first law of geography: Everything is related to everything else, but near things are more related than distant things.” Waldo R. Tobler (Tobler 1970)1
# Libraries
library(tidyverse)
library(spdep)
library(rgdal)
library(sf)
library(sp)
library(rgeos)
library(terra)
library(DT)
The necessary data for this type of analysis are:
# Read municipality data
tn <- readOGR("../data/aggregated_data_per_municipality", encoding="UTF-8")
## OGR data source with driver: ESRI Shapefile
## Source: "G:\Il mio Drive\2nd Year\Geospatial\TrentinoSchools\data\aggregated_data_per_municipality", layer: "aggregated_data_per_municipality"
## with 166 features
## It has 16 fields
## Integer64 fields read as strings: Pop under Pop_mat Pop_ele Pop_med Pop_sup Popolazion
# Setting the CRS
tn <- spTransform(tn, CRS("+init=epsg:4326"))
# Function to replace NAs with 0
na.zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
# Replace NAs about municipalities' number of schools with 0s
tn@data$Scuole.tot = na.zero(tn@data$Scuole.tot)
# Plot the municipalities in Trentino
par(mar=c(0,0,0,0))
plot(tn, axes = F, border="grey")
It is worth noticing from the map of school points that many of them overlap, especially in the Adige’s Valley and most populated areas, such as Trento, Rovereto and Riva del Garda.
# Import shapefile as SpatialPointsDataFrame
schools <- readOGR("../data/Trentino/schools/schools.shp",verbose = FALSE)
# Setting the CRS
schools <- spTransform(schools, CRS("+init=epsg:4326"))
# Plot schools over Trentino's map
par(mar=c(0,0,0,0))
plot(tn, border = "grey", axes = F)
points(schools@coords, col = "#f27059", cex = 1, pch = 1)
Before starting with the global analysis of Trentino municipalities, it is necessary to select some representative points for each municipality as unique reference to the spatial coordinate.
par(mar=c(0,0,0,0))
plot(tn, border="grey")
points(coordinates(tn),
col="red",
bg = "#EF798A",
pch = 21,
lwd = 1.5)
Commonly, the centroid is a good choice, but still some problems can emerge and the centroid may not fall inside the boundaries of the territory. In this particular case, this may happen with multipolygons shape, i.e. those municipalities represented by a multiple number of polygons that do not share a boundary. Some examples are the municipalities of Tione di Trento, Ronzone, Stenico, Calliano, Pellizzano and Riva del Garda. Some municipalities, as Luserna, find their centroid in another territory because of their shape.
The plot depicts all those municipalities whose centroid is outside their actual territory. As can be seen, most of them have little zones outside the main one, while others (e.g. Predaia and Luserna) just have awkward shapes.
tn_coords = coordinates(tn)
tn@data$centroid = tn_coords
colorize = c()
for(i in 1:166){
colorize = append(colorize, point.in.polygon(tn$centroid[i,1],
tn$centroid[i,2],
tn@polygons[i][[1]]@Polygons[[1]]@coords[,1],
tn@polygons[i][[1]]@Polygons[[1]]@coords[,2]))
}
par(mar = c(0,0,0,0))
plot(tn, col = ifelse(colorize, "white","lightgrey"), border="grey")
text(tn_coords, labels = ifelse(colorize, "" ,tn@data$Comune), cex=0.6)
Instead of recurring to coordinates(), we could use gCentroid to obtain an alternative version of centroids. For most of the cases, these two versions coincide, but for those municipalities with multiple polygons points may differ (e.g. Soraga di Fassa in the upper right part). Since it computes a sort of mean point, making the centroid being part of other territories, the rest of the notebook will relate to the previous version (i.e. coordinates), despite inaccurate in some cases.
par(mar=c(0,0,0,0))
trueCentroids = gCentroid(tn,byid=TRUE)
plot(tn, border="grey")
points(coordinates(tn),pch=1, col="blue")
points(trueCentroids,pch=2, col="red")
⚠️ Note that, instead, school dataset contains points and not polygons, therefore no problem occurs with the centroid computation. Also, since many schools have same coordinates because in the same building, the centroids will only consider unique points, reducing the dataset from 724 to 599 unique points.
Since there are various definitions of neighbourhood, we will try to explore three of them in the following sections, starting from K-nearest neighbour.
Since there is no way to choose a specific value for \(k\), we can iterate over a customized range, let’s say from 1 to 20 neighbours (i.e. schools). The following code generates an image for each value of \(k\).
# Saving schools coordinates
school_coords = coordinates(schools)
# Save frame per frame
for(i in 1:20){
png(paste0("../viz/knn/",i,".png"),res=300, width=1000, height=1000)
k <- knn2nb(knearneigh(school_coords, k = i, longlat=T))
par(mar = c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
plot(k, school_coords, lwd=.6, col=alpha("#F27059",alpha=0.5),
cex = .3, add=TRUE, points=FALSE)
dev.off()
}
⚠️ Note that through the function saveGIF()4 it would be possible to save frames as animation, lowering the resolution of the image obtained. Saving each frame will also allow users to select a customized value for \(k\) and look at how the map changes inside the website.
KNN on Trentino Schools
As \(k\) is increased, each school finds more and more neighbours, approaching also further points. If we focus on the first frame with \(k=1\), we may notice that some schools are isolated, since their closest neighbour requires long lines to be reached. An example is the municipality of Vermiglio, in the upper left part of Trentino, in the western boundary. There are in fact \(3\) schools in Vermiglio and one of them is near the boundary, far from other schools. A similar situation happens to Rabbi and Malé, to Luserna and Lavarone, but still Vermiglio is the municipality with the longest distance between schools within the local territory.
k <- knn2nb(knearneigh(school_coords, k = 1, longlat=T))
par(mar = c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
plot(k, school_coords, lwd=.6,
col=alpha("#F27059",alpha=0.5),
cex = .3, add=TRUE, points=FALSE)
text(tn_coords, labels = ifelse(tn@data$Comune %in% c("Vermiglio","Rabbi","Malé","Luserna","Lavarone","Predaia"),tn@data$Comune, ""), cex=0.6)
On the other hand, at the extreme opposite, with \(k=20\), we obtain the network of schools in Trentino, looking at the \(20\) closest schools around each point. Trento and Rovereto are the most intertwined zones, while green areas as Adamello-Brenta Natural Park (west boundary) and Valsugana (empty zone in the right part of the map) lack in the number of schools. In fact, as can be seen by the length of lines, distances between schools tend to increase by moving from the Adige Valley to the east and west boundary.
KNN with k=20
If instead we focus on the municipalities, we can limit to a lower \(k\), since we talk about polygons with other territories at their boundary, while some schools points may result isolated with low values of \(k\). By looking at \(k=5\) plot, we may notice how all municipalities are linked to each other, despite it does not seem so in the territory above Borgo Valsugana, where Dolomites can be found.
knn1 = knn2nb(knearneigh(tn_coords, k = 1, longlat=T))
knn2 = knn2nb(knearneigh(tn_coords, k = 2, longlat=T))
knn3 = knn2nb(knearneigh(tn_coords, k = 3, longlat=T))
knn4 = knn2nb(knearneigh(tn_coords, k = 4, longlat=T))
knn5 = knn2nb(knearneigh(tn_coords, k = 5, longlat=T))
par(mar=c(0,0,0,0))
plot(tn, border = "grey80", axis = tn, lwd=0.5)
plot(knn5, tn_coords, lwd=.6, col=alpha("#F27059",alpha=0.5),
cex = .3, add=TRUE, points=FALSE)
Our aim in this section will be to find the minimum threshold distance which allows all regions/points to have at least one neighbour. By setting \(k=1\) in the k-nearest neighbour, we can first compute the nearest neighbour to each school and the relative distance and then get the maximum distance among them.
knn1<- knn2nb(knearneigh(school_coords,
k=1,
longlat=T))
all.linked <- max(unlist(nbdists(knn1,
school_coords,
longlat=T)))
all.linked
## [1] 9.182375
According to the results, all schools have a neighbour at at least 9.1823746 km. This implies that the cut-off distance has to be greater than it. However, notice from the following plot the distribution of school distances: the majority of them is near \(0\)km, following a long tail distribution. This may happen in big cities, as Trento and Rovereto, where there are a lot of schools and the minimum distance between them lowers.
distances = unlist(nbdists(knn1,school_coords,longlat=T))
ggplot()+
geom_histogram(aes(x=distances), fill='#F27059', bins=50)+
labs(title = "Distribution of distances between schools")+
theme_minimal()
knn1<- knn2nb(knearneigh(tn_coords,
k=1,
longlat=T))
all.linked <- max(unlist(nbdists(knn1,
tn_coords,
longlat=T)))
all.linked
## [1] 11.16466
We can repeat the same computation on municipalities centroids, discovering that every municipality has a neighbour at at least 11.1646609 km, slightly greater than the previous cut-off distance, which could mean that there are multiple schools in every municipality or that they are close enough to the boundary to be neighbour of other municipalities’ schools. Analyzing the distribution of distances between municipalities, we may notice that the majority of them distances from others from \(2\) to \(4\)km.
distances = unlist(nbdists(knn1,tn_coords,longlat=T))
ggplot()+
geom_histogram(aes(x=distances), fill='#F27059', bins=50)+
labs(title="Distribution of distances between schools")+
theme_minimal()
We can try different neighbourhood definitions for different values of the cut-off distance, starting from the minimum threshold found before (i.e. \(9.18\)km).
dnb10 <- dnearneigh(school_coords, 0, 10, longlat=TRUE); dnb10
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 54326
## Percentage nonzero weights: 10.36408
## Average number of links: 75.03591
dnb15 <- dnearneigh(school_coords, 0, 15, longlat=TRUE); dnb15
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 93048
## Percentage nonzero weights: 17.75129
## Average number of links: 128.5193
dnb20 <- dnearneigh(school_coords, 0, 20, longlat=TRUE); dnb20
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 140056
## Percentage nonzero weights: 26.71927
## Average number of links: 193.4475
dnb25 <- dnearneigh(school_coords, 0, 25, longlat=TRUE); dnb25
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 192010
## Percentage nonzero weights: 36.63083
## Average number of links: 265.2072
dnb30 <- dnearneigh(school_coords, 0, 30, longlat=TRUE); dnb30
## Neighbour list object:
## Number of regions: 724
## Number of nonzero links: 247346
## Percentage nonzero weights: 47.18759
## Average number of links: 341.6381
As the cut-off distance increases, the number of links grows rapidly. Based on the visualization, we could have stopped at 20, where nearly every school is connected to others.
plot_neighbour = function(model, coords, title){
par(mar=c(0,0,1,0))
plot(tn, border="grey",xlab="",ylab="",xlim=NULL)
title(main=title, cex.main=0.8)
plot(model, coords, add=TRUE, col="#F27059", pch=16, lwd = 1, points=FALSE)
}
par(mfrow = c(3,2))
plot_neighbour(dnb10, school_coords, "d nearest neighbours, d = 10")
plot_neighbour(dnb15, school_coords, "d nearest neighbours, d = 15")
plot_neighbour(dnb20, school_coords, "d nearest neighbours, d = 20")
plot_neighbour(dnb25, school_coords, "d nearest neighbours, d = 25")
plot_neighbour(dnb30, school_coords, "d nearest neighbours, d = 30")
The same approach could be applied to municipalities data, obviously creating a network based on the territories around a certain area. Remembering that the cut-off threshold in this case is above \(11\), we can start with \(12\).
dnb12 <- dnearneigh(tn_coords, 0, 12, longlat=TRUE); dnb12
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 2000
## Percentage nonzero weights: 7.257947
## Average number of links: 12.04819
dnb16 <- dnearneigh(tn_coords, 0, 16, longlat=TRUE); dnb16
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 3314
## Percentage nonzero weights: 12.02642
## Average number of links: 19.96386
dnb20 <- dnearneigh(tn_coords, 0, 20, longlat=TRUE); dnb20
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 4866
## Percentage nonzero weights: 17.65859
## Average number of links: 29.31325
dnb24 <- dnearneigh(tn_coords, 0, 24, longlat=TRUE); dnb24
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 6712
## Percentage nonzero weights: 24.35767
## Average number of links: 40.43373
dnb30 <- dnearneigh(tn_coords, 0, 30, longlat=TRUE); dnb30
## Neighbour list object:
## Number of regions: 166
## Number of nonzero links: 9652
## Percentage nonzero weights: 35.02685
## Average number of links: 58.14458
par(mfrow = c(3,2))
plot_neighbour(dnb12, tn_coords, "d nearest neighbours, d = 12")
plot_neighbour(dnb16, tn_coords, "d nearest neighbours, d = 16")
plot_neighbour(dnb20, tn_coords, "d nearest neighbours, d = 20")
plot_neighbour(dnb24, tn_coords, "d nearest neighbours, d = 24")
plot_neighbour(dnb30, tn_coords, "d nearest neighbours, d = 30")
Also in this case the number of connections grows rapidly, indicating how close the municipalities are between each other. Consider that the maximum distance within province of Trento between municipalities is around 120km (considering the centroids, therefore the actual distance is greater), but over the \(75\%\) of municipalities has an area below \(50km^2\), which allows them to be connected with brief distances.
# Quantiles of areas of municipalities within the Province of Trento
tn@data$area = round(area(tn)/ 1000000,3)
area = tn@data %>%
arrange(desc(area)) %>%
select(Comune, area)
quantile(area$area)
## 0% 25% 50% 75% 100%
## 1.65200 12.95100 25.57700 48.97075 199.55200
# Find max distance within the province centroids
library(geosphere)
## Warning: il pacchetto 'geosphere' è stato creato con R versione 4.1.2
diff = c()
for(i in 1:166 ){
for (j in 1:166){
diff = append(diff, distm(tn_coords[i,],tn_coords[j,], fun = distHaversine))
}
}
# Maximum distance between centroids within the province of Trento
max(diff)/1000
## [1] 120.7196
Since schools are represented as points, we will use municipalities data to connect territories with common boundary, i.e. multiple municipalities around. From the visualization it is worth noticing the spiderweb created around the municipalities on the inside of Trentino, while those more disconnected are placed on the border of the Province, especially the territories on the upper right part of the map (e.g. Canazei, San Giovanni di Fassa, Mazzin, Campitello di Fassa, Soraga di Fassa, Moena).
par(mar=c(0,0,0,0))
contnb_q <- poly2nb(tn, queen=T)
plot(tn, border="grey")
plot(contnb_q, tn_coords, add=TRUE, col="#EF798A")
points(coordinates(tn),
col="red",
bg = "#EF798A",
pch = 21,
lwd = 1.5)
⚠️ Note that there are 166 municipalities in the Province of Trento. By sorting them according to the shape area, we get that the biggest areas do not share at least one boundary, since they take place on the border or in mountainous zones; while Trento occupies a central position.
area%>%
DT::datatable()
After the definition of neighbourhoods, we can create spatial weights matrices, one for each neighbour list previously created.
# K-nearest neighbour
knn1.list = nb2listw(knn1)
knn2.list = nb2listw(knn2)
knn3.list = nb2listw(knn3)
knn4.list = nb2listw(knn4)
knn5.list = nb2listw(knn5)
# Critical cut-off
dnb12.list = nb2listw(dnb12,style="W")
dnb16.list = nb2listw(dnb16,style="W")
dnb20.list = nb2listw(dnb30,style="W")
dnb24.list = nb2listw(dnb24,style="W")
dnb30.list = nb2listw(dnb30,style="W")
# Contiguity based approach
contnb_q.list = nb2listw(contnb_q)
# List with weights lists and their name
weights = list(
list(knn1.list, "K-nearest neighbour (k=1)"),
list(knn2.list, "K-nearest neighbour (k=2)"),
list(knn3.list, "K-nearest neighbour (k=3)"),
list(knn4.list, "K-nearest neighbour (k=4)"),
list(knn5.list, "K-nearest neighbour (k=5)"),
list(dnb12.list, "Critical cut-off neighbourhood (d=12)"),
list(dnb16.list, "Critical cut-off neighbourhood (d=16)"),
list(dnb20.list, "Critical cut-off neighbourhood (d=20)"),
list(dnb24.list, "Critical cut-off neighbourhood (d=24)"),
list(dnb30.list, "Critical cut-off neighbourhood (d=30)"),
list(contnb_q.list, "Contiguity-based neighbourhoord")
)
In the section, we will focus on Moran’s I test of spatial autocorrelation of Trentino Schools, in particular on the number of schools and students that populate every municipality. Let’s start plotting the distribution of some features over the territory.
⚠️ Note that we do not hold data about students of all schools in Trentino, therefore some data might be missing. In these cases, the plots below will show the respective area in white.
cols = list(tn$Scuole.tot,tn$Studenti, tn$Classi, tn$Media.st_1, tn$Pop_stud.P, tn$Stud.Pop_s, tn$Media.stud)
titles = c("Schools","Students","Classes","Mean of students per school","Population under 20 over Total Population", "Students over Population under 20", "Mean students per class")
colours <- c("#fedb71","#FCD471","#facd71","#f6bf70","#eea26f","#EA946F","#e6866e","#e2786e","#e0716e","#dd696d")
na.ignore = function(x){
x[is.na(x)] <- -1
return(x)
}
par(mfrow=c(4,2),mar = c(0,0,1.7,0))
for(i in 1:7){
c = na.ignore(unlist(cols[i]))
brks <- round(quantile(c, seq(0,1,0.1)), digits=3)
plot(tn, col=ifelse(c==-1,
"#ffffff",
colours[findInterval(c, brks, all.inside=TRUE)]),
main = titles[i], cex.main=2.5)
}
Now we can try to compute the Moran’s test based on all the previous definitions of neighborhood and with the previous features exposed, trying to find some spatial autocorrelation.
Neighbourhood = c()
Column = c()
Sd = c()
p_value = c()
Moran_I_statistic = c()
Mean = c()
Var = c()
Assumption = c()
# Iterate over columns
for(i in 1:length(cols)){
# Iterate over neighbourhood
for (w in weights) {
c = na.zero(unlist(cols[i]))
# Iterate over assumptions
for (rand in c(T,F)) {
Neighbourhood = append(Neighbourhood, w[[2]])
res = moran.test(c, w[[1]], randomisation = rand)
Column = append(Column, titles[i])
Sd = append(Sd, round(as.numeric(res[[1]]),4))
p_value = append(p_value, round(as.numeric(res[[2]]), 4))
Moran_I_statistic = append(Moran_I_statistic, round(as.numeric(res[[3]][1]),4))
Mean = append(Mean, round(as.numeric(res[[3]][2]),4))
Var = append(Var, round(as.numeric(res[[3]][3]),4))
if(rand) {
Assumption = append(Assumption, "Randomization")
}else{
Assumption = append(Assumption, "Normality")
}
}
Neighbourhood = append(Neighbourhood, w[[2]])
res = moran.mc(c, w[[1]], nsim=999)
Column = append(Column, titles[i])
Sd = append(Sd,round(as.numeric(res[[1]]),4))
p_value = append(p_value, round(res$p.value),4)
Moran_I_statistic = append(Moran_I_statistic, round(res$statistic,4))
Mean = append(Mean, "")
Var = append(Var, "")
Assumption = append(Assumption, res$method)
}
}
# create df with results and show them
res_df = data.frame(Column, Neighbourhood, Moran_I_statistic, p_value,
Sd, Mean, Var, Assumption)
res_df %>%
arrange(desc(abs(Moran_I_statistic)), p_value) %>%
DT::datatable()
By ordering results according to the absolute value of the Moran’s I statistics, we can see that the highest statistics obtained are around the \([0.1789, 0.1952]\) interval, got in all three assumptions. Albeit mean students per school seem to be the column with highest Moran’s statistics, the p-value is not below the threshold of \(0.05\) for the majority of its observations. In fact, the mean of students per school results significative with neighbourhoods knn (k=5) and critical cut-off with d=16.
On the other hand, the proportion of Population under 20 over the total population seems significative with every configuration, except for some knns neighbourhoods. However, Moran’s statistics is lower than the ones gathered considering the mean of students per school.
Both these two columns show a positive spatial autocorrelation, while negative ones are associated to the number of Students over the Population under 20, with low p-value and minimum value \(-0.1296\) for Moran’s I statistics.
res_df%>%
group_by(Column) %>%
summarise("Median Moran" = median(Moran_I_statistic), "Median p-value" = median(p_value),
"Mean Moran" = round(mean(Moran_I_statistic),4), "Mean p-value" = round(mean(p_value),4)) %>%
DT::datatable()
Considering an aggregated table with median and mean statistics and p-value, we can confirm that the columns with highest statistics are:
The low mean and median p-values for Students confirms the absence of spatial autocorrelation based on the number of students. Nearly the same happens to the number of Classes and Schools.
Let’s start by trying to model the mean students per school with all the remaining features we have explored. The summary shows that all predictors are significant, except for the population under 20 over total population.
LinearMean <- lm(tn$Media.stud ~ Stud.Pop_s+Scuole.tot+Classi+Media.st_1+Pop_stud.P+Studenti, tn)
summary(LinearMean)
##
## Call:
## lm(formula = tn$Media.stud ~ Stud.Pop_s + Scuole.tot + Classi +
## Media.st_1 + Pop_stud.P + Studenti, data = tn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.440 -1.101 0.160 1.049 4.434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.258131 1.617090 5.107 1.20e-06 ***
## Stud.Pop_s 1.939540 0.578944 3.350 0.00107 **
## Scuole.tot 0.590475 0.113650 5.196 8.13e-07 ***
## Classi -0.200131 0.044154 -4.533 1.35e-05 ***
## Media.st_1 0.039774 0.003293 12.079 < 2e-16 ***
## Pop_stud.P 5.294348 8.533140 0.620 0.53610
## Studenti 0.006696 0.002175 3.079 0.00256 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.787 on 124 degrees of freedom
## (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7054, Adjusted R-squared: 0.6912
## F-statistic: 49.49 on 6 and 124 DF, p-value: < 2.2e-16
With the step function it is possible to simplify this model considering only those features with high significance, excluding the features with no significance (i.e. Pop_stud.P).
# Searching for a simplified model where every feature has high significance
LinearMean = step(LinearMean)
## Start: AIC=158.97
## tn$Media.stud ~ Stud.Pop_s + Scuole.tot + Classi + Media.st_1 +
## Pop_stud.P + Studenti
##
## Df Sum of Sq RSS AIC
## - Pop_stud.P 1 1.23 397.40 157.38
## <none> 396.17 158.97
## - Studenti 1 30.29 426.45 166.62
## - Stud.Pop_s 1 35.86 432.03 168.32
## - Classi 1 65.64 461.80 177.05
## - Scuole.tot 1 86.24 482.41 182.77
## - Media.st_1 1 466.11 862.28 258.85
##
## Step: AIC=157.38
## tn$Media.stud ~ Stud.Pop_s + Scuole.tot + Classi + Media.st_1 +
## Studenti
##
## Df Sum of Sq RSS AIC
## <none> 397.40 157.38
## - Studenti 1 29.26 426.66 164.68
## - Stud.Pop_s 1 34.66 432.06 166.33
## - Classi 1 64.61 462.01 175.11
## - Scuole.tot 1 89.16 486.56 181.89
## - Media.st_1 1 591.27 988.66 274.77
summary(LinearMean)
##
## Call:
## lm(formula = tn$Media.stud ~ Stud.Pop_s + Scuole.tot + Classi +
## Media.st_1 + Studenti, data = tn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2233 -1.0388 0.1256 1.0652 4.3898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.228110 0.412372 22.378 < 2e-16 ***
## Stud.Pop_s 1.884257 0.570637 3.302 0.00125 **
## Scuole.tot 0.597451 0.112814 5.296 5.16e-07 ***
## Classi -0.197897 0.043899 -4.508 1.49e-05 ***
## Media.st_1 0.040634 0.002980 13.637 < 2e-16 ***
## Studenti 0.006533 0.002154 3.034 0.00294 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.783 on 125 degrees of freedom
## (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.7045, Adjusted R-squared: 0.6927
## F-statistic: 59.6 on 5 and 125 DF, p-value: < 2.2e-16
The plot of the studentized residuals of the linear model can give us a hint about the presence of spatial dependence in the residuals. In fact, some similarities may be found in the east part, near the border and within the Adamello Brenta Park.
par(mar=c(0,0,1,0))
studres <- rstudent(LinearMean)
resdistr <- round(quantile(studres, seq(0,1,0.1)), digits=3)
#colours <- grey((length(resdistr):2)/length(resdistr))
plot(tn, col=colours[findInterval(studres, resdistr, all.inside=TRUE)],
main = "Residuals quantiles in Trentino")
The command that allows to perform the Moran’s I test in the OLS residuals is the function lm.morantest(). In the following chunk, the test to the studentized residuals of the linear Solow model, for different specifications of the spatial weights matrix. This method will be applied to all neighbourhoods definition, except KNN with \(k\) lower than 4 and the contiguity neighbourhood, since they return unusual results.
ols_res = data.frame(Neighbourhood = c(""),
Moran = c(""),
p_value = c(""))
# Moran test on residuals
for(i in 4:11){
t = lm.morantest(LinearMean,weights[[i]][[1]],resfun=rstudent)
ols_res = rbind(ols_res, c(weights[[i]][[2]],
t$estimate['Observed Moran I'],
t$p.value))
}
ols_res$Moran = as.numeric(ols_res$Moran)
ols_res$p_value = as.numeric(ols_res$p_value)
ols_res %>%
arrange(desc(abs(Moran))) %>%
filter(!is.na(Moran)) %>%
DT::datatable()
The obtained results provide a negative spatial autocorrelation, but the p-value is far from being below to the threshold to confirm this hypothesis. Therefore, no spatial autocorrelation is found in the residuals of this model.
By focusing instead on the population under 20 over the total population for each municipality, we obtain from the summary that only the mean students per class is significative.
LinearPop <- lm(tn$Pop_stud.P ~ Stud.Pop_s+Scuole.tot+Classi+Media.st_1+Media.stud+Studenti, tn)
summary(LinearPop)
##
## Call:
## lm(formula = tn$Pop_stud.P ~ Stud.Pop_s + Scuole.tot + Classi +
## Media.st_1 + Media.stud + Studenti, data = tn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.059060 -0.011379 0.002452 0.010725 0.045638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.778e-01 9.719e-03 18.296 <2e-16 ***
## Stud.Pop_s -1.154e-02 6.268e-03 -1.842 0.0679 .
## Scuole.tot 9.683e-04 1.315e-03 0.736 0.4629
## Classi 5.377e-04 4.986e-04 1.078 0.2829
## Media.st_1 1.387e-04 4.950e-05 2.802 0.0059 **
## Media.stud 5.846e-04 9.422e-04 0.620 0.5361
## Studenti -3.461e-05 2.351e-05 -1.472 0.1435
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01878 on 124 degrees of freedom
## (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.184, Adjusted R-squared: 0.1445
## F-statistic: 4.659 on 6 and 124 DF, p-value: 0.0002638
By applying the step function, the mean of students per school and the number of classes are erased. However, the adjusted r-squared is too low to guarantee the quality of this model.
# Searching for a simplified model where every feature has high significance
LinearPop = step(LinearPop)
## Start: AIC=-1034.61
## tn$Pop_stud.P ~ Stud.Pop_s + Scuole.tot + Classi + Media.st_1 +
## Media.stud + Studenti
##
## Df Sum of Sq RSS AIC
## - Media.stud 1 0.00013579 0.043877 -1036.2
## - Scuole.tot 1 0.00019128 0.043933 -1036.0
## - Classi 1 0.00041026 0.044152 -1035.4
## <none> 0.043742 -1034.6
## - Studenti 1 0.00076467 0.044506 -1034.3
## - Stud.Pop_s 1 0.00119657 0.044938 -1033.1
## - Media.st_1 1 0.00276903 0.046511 -1028.6
##
## Step: AIC=-1036.2
## tn$Pop_stud.P ~ Stud.Pop_s + Scuole.tot + Classi + Media.st_1 +
## Studenti
##
## Df Sum of Sq RSS AIC
## - Classi 1 0.0002938 0.044171 -1037.3
## - Scuole.tot 1 0.0004336 0.044311 -1036.9
## - Studenti 1 0.0006498 0.044527 -1036.3
## <none> 0.043877 -1036.2
## - Stud.Pop_s 1 0.0010645 0.044942 -1035.1
## - Media.st_1 1 0.0094505 0.053328 -1012.6
##
## Step: AIC=-1037.33
## tn$Pop_stud.P ~ Stud.Pop_s + Scuole.tot + Media.st_1 + Studenti
##
## Df Sum of Sq RSS AIC
## <none> 0.044171 -1037.3
## - Scuole.tot 1 0.0007497 0.044921 -1037.1
## - Stud.Pop_s 1 0.0007708 0.044942 -1037.1
## - Studenti 1 0.0009575 0.045129 -1036.5
## - Media.st_1 1 0.0092138 0.053385 -1014.5
summary(LinearPop)
##
## Call:
## lm(formula = tn$Pop_stud.P ~ Stud.Pop_s + Scuole.tot + Media.st_1 +
## Studenti, data = tn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.059266 -0.011806 0.002099 0.010582 0.046054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.826e-01 4.277e-03 42.690 < 2e-16 ***
## Stud.Pop_s -7.607e-03 5.130e-03 -1.483 0.141
## Scuole.tot 1.649e-03 1.128e-03 1.462 0.146
## Media.st_1 1.596e-04 3.113e-05 5.127 1.08e-06 ***
## Studenti -1.100e-05 6.658e-06 -1.653 0.101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01872 on 126 degrees of freedom
## (35 osservazioni eliminate a causa di valori mancanti)
## Multiple R-squared: 0.1759, Adjusted R-squared: 0.1498
## F-statistic: 6.725 on 4 and 126 DF, p-value: 6.134e-05
ols_res = data.frame(Neighbourhood = c(""),
Moran = c(""),
p_value = c(""))
# Moran test on residuals
for(i in 4:11){
t = lm.morantest(LinearPop,weights[[i]][[1]],resfun=rstudent)
ols_res = rbind(ols_res, c(weights[[i]][[2]],
t$estimate['Observed Moran I'],
t$p.value))
}
ols_res$Moran = as.numeric(ols_res$Moran)
ols_res$p_value = as.numeric(ols_res$p_value)
ols_res %>%
arrange(desc(abs(Moran))) %>%
filter(!is.na(Moran)) %>%
DT::datatable()
In this case instead, we can notice how low the p-value is, sometimes even below the 0.05 threshold. This may indicate the presence of spatial autocorrelation in the residuals, especially with knn k=5 and consequently that the model is misspecified. Equivalent results have been obtained by rotating and discarding some predictors.
If spatial autocorrelation is present it will violate the assumption about the independence of residuals and call into question the validity of hypothesis testing6, leading to the rejection of the Population under 20 over the total population.
par(mar=c(0,0,1,0))
studres <- rstudent(LinearPop)
resdistr <- round(quantile(studres, seq(0,1,0.1)), digits=3)
#colours <- grey((length(resdistr):2)/length(resdistr))
plot(tn, col=colours[findInterval(studres, resdistr, all.inside=TRUE)],
main = "Residuals quantiles in Trentino")
The following grid of images shows the Moran’s Scatterplot within all different notions of neighborhoods. The membership of municipalities inside a quadrant or another changes according to the neighbourhood definition. For instance, Novaledo often switches from HH to HL quadrant and Valfloriana from LH to LL and viceversa. Also, since this feature is a proportion that goes from \(0.10\) to \(0.24\), many municipalities overlap or assume the same value (columns of dots). Nevertheless, some outliers are noticeable, represented with a different style and their label.
As stated forehead, the quadrants with the pink line and its confidence interval shows the municipalities with similar proportion of students population over the total one (i.e. positive spatial autocorrelation). An example is the relationship between Novaledo and Vignola-Falesina, which are usually in the same HH quadrant, have a high proportion of students and they are close, geographically talking. On the opposite side, Cinte Tesino and Castello Tesino show a low value for students over the population.
On the other hand, municipalities on the remaining quadrants, have dissimilar values if they belong to opposite quadrants. For example, Fierozzo and Frassilongo share a boundary, but their proportion of students is 0.22 and 0.15 respectively, enhancing a big dissimilarity between these two municipalities.
This chunk shows how many municipalities belong to a specific quadrant. It seems like only 60 over 166 municipalities have a spatial autocorrelation.
# PLOTTING REGIONS OF INFLUENCE ABOUT STUDENTS OVER POPULATION UNDER 20
color_mapping = list("LL" = "#FEDB71",
"LH" = "#F6Bf70",
"HL" = "#E6866E",
"HH" = "#E0716E",
"None" = "white")
hotspot <- as.numeric(row.names(as.data.frame(summary(mps[1][[1]]))))
tn$Stud.Pop_s = na.zero(tn$Stud.Pop_s)
tn$wx <- lag.listw(dnb12.list, tn$Stud.Pop_s)
tn$quadrant <- rep("None", length(tn$Stud.Pop_s))
for(i in 1:length(hotspot)) {
if (tn$Stud.Pop_s[hotspot[i]]>mean(tn$Stud.Pop_s) & tn$wx[hotspot[i]]> mean(tn$wx))
tn$quadrant[hotspot[i]] <- "HH"
if (tn$Stud.Pop_s[hotspot[i]]>mean(tn$Stud.Pop_s) & tn$wx[hotspot[i]]< mean(tn$wx))
tn$quadrant[hotspot[i]] <- "HL"
if (tn$Stud.Pop_s[hotspot[i]]<mean(tn$Stud.Pop_s) & tn$wx[hotspot[i]]<mean(tn$wx))
tn$quadrant[hotspot[i]] <- "LL"
if (tn$Stud.Pop_s[hotspot[i]]<mean(tn$Stud.Pop_s) & tn$wx[hotspot[i]]>mean(tn$wx))
tn$quadrant[hotspot[i]] <- "LH"
}
table(tn$quadrant)
##
## HH HL LH LL None
## 5 14 20 21 106
In order to better visualize them, the plot is served, showing in white the municipalities with no quadrant and the one inside the Moran’s quadrants with a gradient.
⚠️ It is suggested to open the image in a new window to better read municipalities labels, whose size has been reduced for readability issues.
tn$colours = unlist(color_mapping[tn$quadrant])
par(mar=c(0,0,1,0))
plot(tn, col=tn$colours, main="Regions with influence on students over population under 20")
legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
fill=unlist(color_mapping), bty="n", cex=0.8)
text(tn_coords, labels = ifelse(tn$quadrant=="None", "" ,tn@data$Comune), cex=0.5)
The following grid of images shows the Moran’s Scatterplot within all different notions of neighborhoods, considering the logarithmic scale. The main difference between
# PLOTTING REGIONS OF INFLUENCE ABOUT MEAN OF STUDENTS PER CLASS
color_mapping = list("LL" = "#FEDB71",
"LH" = "#F6Bf70",
"HL" = "#E6866E",
"HH" = "#E0716E",
"None" = "white")
hotspot <- as.numeric(row.names(as.data.frame(summary(mps[1][[1]]))))
tn$Media.st_1 = na.zero(tn$Media.st_1)
tn$wx <- lag.listw(dnb12.list, tn$Media.st_1)
tn$quadrant <- rep("None", length(tn$Media.st_1))
for(i in 1:length(hotspot)) {
if (tn$Media.st_1[hotspot[i]]>mean(tn$Media.st_1) & tn$wx[hotspot[i]]> mean(tn$wx))
tn$quadrant[hotspot[i]] <- "HH"
if (tn$Media.st_1[hotspot[i]]>mean(tn$Media.st_1) & tn$wx[hotspot[i]]< mean(tn$wx))
tn$quadrant[hotspot[i]] <- "HL"
if (tn$Media.st_1[hotspot[i]]<mean(tn$Media.st_1) & tn$wx[hotspot[i]]<mean(tn$wx))
tn$quadrant[hotspot[i]] <- "LL"
if (tn$Media.st_1[hotspot[i]]<mean(tn$Media.st_1) & tn$wx[hotspot[i]]>mean(tn$wx))
tn$quadrant[hotspot[i]] <- "LH"
}
table(tn$quadrant)
##
## HH HL LH LL None
## 18 6 12 24 106
tn$colours = unlist(color_mapping[tn$quadrant])
par(mar=c(0,0,1,0))
plot(tn, col=tn$colours, main="Regions with influence on mean students per class")
legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
fill=unlist(color_mapping), bty="n", cex=0.8)
text(tn_coords, labels = ifelse(tn$quadrant=="None", "" ,tn@data$Comune), cex=0.5)
This last inquiry is due to the great gap discovered in the choropleth map inside the website that shows the proportion of actual students over the total population under 20 years old of every municipality. It seems in fact that some municipalities host more students than those who actually live in that area, leading to the possible conclusion that students may need to move from their city to another to go to school every day.
The following grid of images shows the Moran’s Scatterplot within all different notions of neighborhoods, considering the logarithmic scale. The main difference between
# PLOTTING REGIONS OF INFLUENCE ABOUT MEAN OF STUDENTS PER CLASS
color_mapping = list("LL" = "#FEDB71",
"LH" = "#F6Bf70",
"HL" = "#E6866E",
"HH" = "#E0716E",
"None" = "white")
hotspot <- as.numeric(row.names(as.data.frame(summary(mps[1][[1]]))))
tn$Media.st_1 = na.zero(tn$Media.st_1)
tn$wx <- lag.listw(dnb12.list, tn$Media.st_1)
tn$quadrant <- rep("None", length(tn$Media.st_1))
for(i in 1:length(hotspot)) {
if (tn$Media.st_1[hotspot[i]]>mean(tn$Media.st_1) & tn$wx[hotspot[i]]> mean(tn$wx))
tn$quadrant[hotspot[i]] <- "HH"
if (tn$Media.st_1[hotspot[i]]>mean(tn$Media.st_1) & tn$wx[hotspot[i]]< mean(tn$wx))
tn$quadrant[hotspot[i]] <- "HL"
if (tn$Media.st_1[hotspot[i]]<mean(tn$Media.st_1) & tn$wx[hotspot[i]]<mean(tn$wx))
tn$quadrant[hotspot[i]] <- "LL"
if (tn$Media.st_1[hotspot[i]]<mean(tn$Media.st_1) & tn$wx[hotspot[i]]>mean(tn$wx))
tn$quadrant[hotspot[i]] <- "LH"
}
table(tn$quadrant)
##
## HH HL LH LL None
## 18 6 12 24 106
tn$colours = unlist(color_mapping[tn$quadrant])
par(mar=c(0,0,1,0))
plot(tn, col=tn$colours, main="Regions with influence on mean students per class")
legend(x=11.38, y=45.95, legend=c("Low-Low", "Low-High", "High-Low", "High-High","None"),
fill=unlist(color_mapping), bty="n", cex=0.8)
text(tn_coords, labels = ifelse(tn$quadrant=="None", "" ,tn@data$Comune), cex=0.5)
Tobler, Waldo R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46 (2): 234–40. http://www.geog.ucsb.edu/~tobler/publications/pdf_docs/A-Computer-Movie.pdf.↩︎
Wikipedia page of Centroid, https://en.wikipedia.org/wiki/Centroid↩︎
How to find the centre of a polygon i Python, https://deparkes.co.uk/2015/02/28/how-to-find-the-centre-of-a-polygon-in-python/↩︎
saveGIF() documentation https://www.rdocumentation.org/packages/animation/versions/2.4.1/topics/saveGIF↩︎
Spatial regression models documentation, https://rspatial.org/raster/analysis/7-spregression.html↩︎
Spatial Autocorrelation, https://ibis.geog.ubc.ca/courses/geob479/notes/spatial_analysis/spatial_autocorrelation.htm#↩︎